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Sequence comparison methodologies have evolved nnMk 
so no previously published twhuS^SSjffi 
of programs commonlv used. For examole B^mlf 
blast (1) have changei and ^L^aS^S^SL 

oi fast a (3) previously tested was 1.6 but the cun-em r^ul 

£e three most commonly used proerams Of hese ihesSJh 
Waterman algorithm (8) implemented m ssearch f3 f ^ ^ 

Sjffi,!!"" r rous Mode " ^ 

it m. ^ (1) the s P eed and convenience to make 

fast. A?° P J! la ^ r0gram ,nte ™*a,e between thL two 
■s Fast a (3). which may be run m two modes offerin. e«h£ 

proglaL C ° nS,dered diffefem Parame,ers <° r °' 'h«e 

To lest the methods. Pearson selected two representative 

m dau ^ Till 1 Pr ° ,Cin de^dbv" 
™ ° ataDa se (9). Each was used as a querv to search th, 
database, and the matched proteins w e ? e marked a Je n! 
homolog OU so runre | ateaaccordmBtotheir ^^ h asbemg 



A3SnUCT Pairwise sequence comparison methods have 
been a„e,.«| using proteins whose relationship, .« i™ 
reliably from their structures and funetlonl 
^escoPaataha^ f Murxin. A. cjlSS" S. £ SSSJ T 
& Chotbia C. (1955) J. Md. Bid. 247. 536-540?. TheevSuI' 
M»ler ei W M* "72" BUCT WUchu..T F. Gish W 

2i^? 101, W ' BUsn (AlUchul. S. F. & Gish W (19« 
£» ? m«/. 266. 460-480], facta |hJS w T» 
Ltpman. D.J. (1988) P«c. Natl. Acad. Sci. ^ sT^-^sf 

£ SSSStJi- ■* w " ,er,n " ,, ' M s - ™%Bl 

nf. i .i ' -.r ■ " d thelr """"S "bemes. The error rate 
of all algorithms is greatry reduced by using sutisticaTscom 

cores r—f" r " U,er U, " B ""—i Went!" or ^ 
™k?m 1 * eE - val »«"«**«ica« scores of sseaBCH and facta are 

Ty l^ ^ZZZF* H0WeVW ' ,he P - Va,U " "P°« ed 
oy blast and wvbush exaggerate significance bv order* «f 

ESI SSEARCH ' F ^ A kn,p • " d 

oerween proteins whose sequence identities are >30* Fn, 

one-balf of the relationships between proteins with 20-30* 
id.nt.ty are found. Because many homologs h«« 7o £ eo„ * « 
«.m,lar,ry most distant relationship, cawo, bV d,,m"d b\ 
any pairwwe comparison method; however, those wWcn .« 
■dentined may be used with confidence. 

brTh" C f da ! aba f e MarChin8 P ,a ^ a ro,e i" virtual everv 
branch of molecular biology and is crucial for interp7«i„Tme 
sequences .ssumg fonh from genome projects ^fve" 8 tt 

«™h f , ""J? Ir" 6 ' " b ""P^ that overall and rdatrve 
capab.ht.es of different procedures are largelv unknot t is 
dtfficuit to verify algorithms on sample data be«uTe 
requires large data se B of proteins wtfose evo uSonarv r f 
me?h^ t rC ^""""biguousry and independentlv of the 
me hods being evaluated. However, nearlv all known £ 

to be tested). Al». 11 is generally very difficult to know in the 
absence of structural data, whether two proteins thaXck cllar 
equence simflarity are unrelated Thfc hJ me» t thaf al 
though previous evaluations have helped improlT seouence 
comparison they have suffered from Z,aSXSS 
charactenzed. or artificial test data. AmmonXS^ 
problematic because high quality daubas^quenc^se^chm, 

sp!Xtv ,0 t! ,ave ^ r^ {i ^^SSS 

specificity (rejection of unrelated proteins); however th«e 

accoro-nc. w lth ,8 U^.C. 51734 ^eTv^f^ >» 
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subfamilies. Pearson found that modern matrices and "In- 
scaling of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than fasta, which was in turn more effective 
than blast. 

nrK"!? K ^%f a,y ? e f I 0f have been Performed 

(10). and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fasta. Heir test with blast 
considered the ability to detect homoiogs above a predeter- 
nttned score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BUSUM62 matrix (14) performed markedly better than the 

bTen'p^uUr.' AM ' SeneS """^ (15)> which had 
A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homolos. But in 
n™"^ and ,hC Hen *° f6 ' evaluations of sequence com- 
E£ k C r « ttltswere effectively unknown. This is 

creS hv SU P erf f l,,,es m " R »*» «osrre are principaUy 
whTch a « ^ S f qUentt "'npansori methods 

™ are bemg evaiuated ' imerdependency of data and 
eSmt ST * 8nd and mea^sTo 

.? 7 W0M * Penal^d for correctly 

dent.fvmg homoiogs missed by older programs. For instance 
mmunoglobuhn variable and constant domains are clears 
STS" 5 - bM Z lR pl8Ces them in diff erent superfamilie? 
™« wldes P read: e«h superfamUy in pir 48.00 with 
a structural homolog is itself homologous to an average of 1 6 
other pir superfamilies (16). B 

To surmount these sorts of difficulties. Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a lennh- 
dependent threshold of percentage identity, above which all 
proteins would be of similar strucfure. A result of this analvs* 
was the hssp equation; it states that proteins with 25% HS 
over 80 residues will have similar structures, whereas EE 
alignments require higher identity. (Other studies also have 
used structures (18-20). but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection ) 
A general solution to the problem of scoring comes from 
stat.st.cal measures (i.e.. E-values and P-values) b£d TonTh™ 
extreme value distribution (21). Extreme value scoring wa! 
implemented analytically in the bust program using the 

S!3,J^ A ^ hUl SU,iStiCS (22 « *> and «"P*3 ap- 
proaches have been recently added to fasta and ssearch. In 
addition to bemg heralded as a reliable means of receding 
S wmi ar proteins (24. 25). the mathematical trac- 

if„o I - f T* 1 SCOres " b 8 crucial feature of the blast 
algorithm ( 1 ). The validity of this scoring procedure has been 

ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found with™ 

tSttSZtt ^ F> ^ ° bV,OUS 'y *> «« """in an 
p« T ™ aJ,h0U * h man y researchers have sug- 
™ Z L " al !f ,,caJ scores ** used 10 rank matches (24 25 
£f 'I T "° large r, 8° rous experiments on biolog-' 
s"f!eSr me degTee 10 Which SUCn a '< 

A Database for Testing Homology Detection. Since the 
discovery thatthe structures of hemoglobin and myoglobin m 
ver> -similar though their sequences are not (29). it has been 
apparent that comparing structures is a more powerful (if less 

s e h 0 r,h a en,) ^ ,0 TeC0Snae diSlam ^lutionary relatioT 
d^J^r com Parmg sequences. If two proteins show a high 
degree of similarity in their structural details and function, it 
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«s very probable that thev have an evolutionary relating;- 
wough their sequence similarity nuTbe bS ' ni,Uotaltt P 
hi*7 r " en f * rowth of Protein structure information com- 
bined wuh the comprehensive evolutionary dasSioTfe 

limitations. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 

u« structural information to recognize distant homoiogs. the 

TwTuS m ° wh,ch u can «* determined unambJoush, 

HnTwould 2 Zt * 85 ^ gi0bua or * e imrnunogJobu. 
k£L recognized as related bv the vast majority of the 

biological community despite the lack of high sequence siS 

From scop we exrtaaed the sequences of domains of 
proteins ,n the Protein Data Bank (FOB) (30) and cSV£ 
£ <PI ? 90D - B) haS were at 

Jen 2 tS'^V' WhCreaS (PDB40D - B > had '"Ose <40% 
identical. The databases were created by first sortins aU 
protein domains m scop by their quality and (making a list The 
highest quality domam was selected for inclusion in the 

an . d d ' sca / ded > were other domains above the threshold 
level of identity to the selected domain. This process was 

contains U23 domains, which have 9 044 nrden-H 
d«tant relationships, or H15% SZ %X3S& Sri 
pain. In pdbmd-b. the Z079 domains have 53.988 relation 

o?Teaue P „ c ^ tmg ^* ° f 311 p4,R - Lw complexity regS« 
of sequence can achieve spurious high scores so these w^r. 

T£ recl' 3 ^ ^ PrOCKS S wi^helso p^ 
recommended parameters: 12 1.8 2.0. The databases 
used ,„ , hB paper are avaUable {fom http . //ss$ sun{ d ^ » 

m a vbrl 3 H b8S f de ^ d from the ^"t version of sS 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/ 
Pr^Vf r ° m bot *l databa *es were generally consistent, but 
he^ ;!,^ 6 * °" d,SUnt,y re,a,ed P«>teinsand reduces the 
families (31. 32). whereas pdbwd-b (with more seouences* 

I !h„, ^ ,K h ° m0l ° 8 reSU,tS here are f rom PDB40D-B. 

Although the prec.se numbers reported here are specific to the 

genera? We Mpect » b. 

Assessment Data and Procedure. Our assessment of se- 
«,t n „ C „ C CO T ar,so " be div *ded into four different major 
algo rithm „ a timei we eva|uated , he ^ f J ar 
d f erem sconng schemes. Second, we assessed the reliaWliry 
of scormg procedur«^„ c |ud,ng an evaluation of the valid r? 

£n i. *?J SC0 " ng Th ' rdl We com P ared eompS 
son aigoruhms ,us.ng the optimal scoring scheme) .o d«e - 
mine ; their relative performance. Fourth we examined the 
distribution of homoiogs and considered the power o?pairw?,e 

used the databases of structurally identified homoiogs anda 
new assessmeni criterion. 

B JS.n alVSeS ' eSle -, d n BL ?f T version 1 4 - 9M P- and wu- 
blast: (2). vers.on 2.0al3MP. Also assessed was the fasta 

s P SE C A £H Ver T 3 0,76 (3) - Which P rovided Fast a and rte 
" 1 H 'rnplementat.on of Smith-Waterman (8). For 

JZL I \, defau " Parameters and matru (blo- 
SUM62) were used for blast and wu -blast? 

rrT^" C ° Vfragt VS - Error " P,0L To ,est a Panicular protocol 
H S , a . Pr0gram and XOrm * scheme >- ea =h sequence 
^tlTT T Uied " 3 qUer ? to search th = database 
Tins yelded ordered pa.rs of query and target sequences with 
associated scores, which were sorted, on the basis Ttheir 
scores, from best to worst. The ideal method woufd £ 
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for statisucal scores raw « OF « »fdTi. """"^ schemes a " d The roota s^o^ pr0?r,m ^ of th" angle 

all homoiogs .n the «^bS?S^S from ' d «'"> In tSe^e^^.n,^ < EP °> 

same fold dmded by the io«l U Z of f ' WhlCh *** been delecttd - P^asery itTite nuX? of 0* * *!" ,ndica,es ,he of 
identificauon of 904 reU^L^Th^^ 

comparison. 13 error, correspond to 0 01 o, l£ EPO ^ ° f ^ «e S^J^ntX ■.IT ° f 10% " ,diC1,e, 

accuracy which mav be desired The 1™L ,1. . °" " V " u B P r «">«ed on a log scale to sno^JL^ 1 ,hc ">»«>•» all-vs.-all 
demonstrates the trade-off^,™ thi " corres P°nd to the levels of EPO and e»«™ . T " ,he wide,v varv "»« degrees of 

up.. The -re £SJ, *<L< «- Table ,f Tnf graph 

selecung unrelated protemi Three mV«™« J?2 graph - whlcn corresponds to identifying/™ , mm are roade " nov "W 

the aligned region of the prati™ f wTrtom ^ ^ mage ' dem " y are P ,otted - Pcrcen7a« idem SwrtS 5 " eVolu " on " v relat.onsh.ps without 
«n .hell lg „ed%eg,on L a P «^'u« °f he ™ 

£ length for >„'< , < lO^KET^J^E 8 TfL and "*"" ^ 

- ahgnmen, minus „. Smit h-Waterma„ raw scores » i^lL^ 

perfect seoaration wi,h ,11 „f .u. u q comparison program. 



perfect separation, with all of the homoiogs at the top of the 
list and unrelated proteins below. In practk* perte™ 
uon ,s tmpossible to achieve so instead o^HS^ 
of rHaf H ^ *« ™ *« ia g« number 

°Lr T, ^ ° f SeqUenCK with accept 

for 0 ev r erTtSh 0 ,d" V r Ved C0Vera * e and «™ 

„ iu L ' d Covw V "as defined as the fraction of 
structuralh, determined homoiogs that have scores above the 
selected threshold; this reflects the sensitivity ™ a *^t h o d 
Errors per query (EPQ), an indicator of sele«ivL " the 

tT°7 l ° S0US P3,rS *« *re S ho.7divided 

Svefa« « f 6 " 65 GraphS ° f ,hese da ». called 

coverage vs. error plots, were devised to understand how 



protocols compare at different l*v.i* ~t — 
bene" ^,^^^3^^ 34, but 

te«v ■ 'u2 « i,Zn PlaCM 3 prem,um on score consis- 
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Lre^sTnonhe^a^ 

Rasmol (40). wgnificam. Proiems rcndt5red by 
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Peront Identity of y nnimmS ProtMin% (P0BMr>B) 



E»cn ootnt ptots me i»noih and 
□ercem loenmy of an alignment 
t>#, "^« n unrwateo oroietns 




HSSP Thmtnold 



0 50 100 

Aitgnmwit itnoth 

ssearch ,s Plotted as a pomTwhosr^Z' ?0 r P™"" ' OUnd W,,h 
with a different matrix and parameienf ° lpP ' ,ed 
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FtouabUlty of Statameal Scant (PDBMO-fl) 
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the ™'u«io^^° f ,uusueal *» «"■»* Each line show, 
™«7m i S52^?^ n rep T ed ,,i,1I,ieal «*• error 
rate for a different program. E-v.| ue s are reported for search and 
pasta, where* P-vaiue, ire shown for BLASTand wu-bL^tTw te 
scoring were perfect, then the number of errors per queryand tte 
E-values would be the tame, as indicated bv the 00^^" lir* 

S&ZZZZ? ,nd,ea,ed bV H* loWer bold Une > fSS 
^ *" S " 0 "" 1 to agreement with EPQ but 

underesumate the significance sitghtly. biast and wu-biast are 
o^rconfident. w.th the degree of e»ggeraiion dependent^ "ne 
score. The results for pdwod-b were s ,mii ar to those for rDMM 

«^l. ,he I'"""" ' n nUmber ° f ho ""**» detected Th^S 
could be used ,0 roughly calibrate the reltabihtv of a g,ven staS 

ignored in previous tests but is essential for the straightforward 
or automattc interpretation of sequence comparison resu!« 

fh Jl k 8 deW indication of the confidence "h« 

should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reported bv data! 
base searching programs, if the programs' estiinate, are accu- 

™, e Pw *™"« of Sco ™8 Schemes. All of the programs 
tested could prov.de three fundamental types of scores l£ 
first score is the percentage identity, which Vav be computed 

"s J,h u? . h l se * uenc " ™ e *«ond « a "raw" or 

.hX? r a " "7* Which is ,he measure optimized bv 
he Smith-Waterman algorithm and is computed bv summing 
*e ubsmution matru scores for each position in the align- 
ment and subtracting gap penalties. In blast, a measure 



Sequent* Comparison AloortUims (POB40O-B) 




016 0.18 
Covwao* 



0.22 



related to this score is scaJed into bits. Third is a statistic.) 

zrjszszr ^ ™~ =2 

i d,B,l!y - Th0Ush ft has °« n ,on S «»Wished that 

Eh?, k n " tV " 3 P°° r measure < 35 >- ,h " e « a ""won 

MoretS «^f a,mg "I" "* ideMi * »P" r '« •"Otology 
Moreover, publications have indicated that 25* identity can 

ong nally derived years ago. are not supported bv present 

StSja issrsi ass; 

such incorrect matches are typically not significant TlK£ 
ctpal reasons percentage identity does so^oorl" seem 

of r^dlLTnT *"» and ^< E££ 
E£J? » ture of residue substitutions. 
From the PDB90D-B analysis in Fit 3 we leam th»r inc- 

SSL**^ " ^baTe ^on v for 

£££ a,, « nments of « 150 residues. Because' one 

TSftSO 1 ° f haS 43J% idcn,itv ° v « C refuel 

nS ne ^ fOTa,i - 8nments «° be " ^ast 70 r«'d^ 
of ^ * a „ reas ° na bl' threshold, for a datab« 

01 this particular size and composition 

^ Ve " ™ Iiabiiil >- scores based on percentage idcnt.rv 

fht^S^f^* ° nC measnres <hc percentage identity hi 

Use of ?hf^«L ° f d ' S,am homolo ? s ■« <*««?ed 

^ n ,l t 6 , HSP c ? ua, '0«» ""Proves the value of percentage 
.dentiry. but even this measure can find onlv 4% of all know^ 

£w Sc 0 « 0 S m r n w m . eaSUred in 3 SeqUenee comparisS 
th,T T Smith-Waterman raw scores perform better 

no ,X r K em r Se ,dem " y (Rg - 1 >• DUt '"baling (7) provided no 

X„ ?. ef ",' n ° Ur an8,ysis - h b necessar y "> «.e verv prVc£e 

^«^' nS e,,h rj ° r bit scores becausi a change « 
cutoff score could v.eld a tenfold difference m EPQ How w 

eir a b d Hr? U t 1° Ch °° Se a PP r °P" a " threshold because Te 
m a Sd^nd a ,h n SCOfe d T ndS °" thc ,e "8 ,hs of «he prote „ 
also are affected by matru and gap parameters. 

SUhstical Scores. Statistical scores were miroduced part* 
.0 overcome the problems that arise from raw scores E 

homZ SCheme Pr ° VideS ,he besl o-.scriminat.on between 
homologous protons and those which are unrelated Most 
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Fig. 5. 




aTfi'^^^^^ eompanion methods afe eva)ua(ed 

' t IS' FAS l A k 2 Up • 1 Md «'e almost J g^ toSSlTV? - " , ? S,0W SSE * RCH «'"='' f'"<»s 1 8% of reia^ita. 

« » EPO on Uiu databue. ...hough a, h.gher leveU 0^'^^?^ >~ 
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*!£!ZS^2£?* 10 of more 

details about ,he seq^ncV e„!£ 1^°^ bw a,K> hM 
«aled appropriately 8 and «"»P«"ion and is 

slightly conservative estimate of the changes of ,Z J ' 

cess asrsstSr? » 

best nJZ ^^^^^ ntuie ^ the 
homoloes (Fig SB) The ™JJ2* of structurally known 

ipiy. oniy 4U/ e of homoiogs with 20-25% identity 
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.denmy (usmg the measure of SvT^Th^f H" 0 *"* ,0 lheir 
he number of these paus found bv thecal npOBS indici,e 
(ssearch with E-values) at l%EPO tZ £S ^ blue,Mre "">? method 
proteins with <40% identuv a „d P! h " >840I>Bdaub » ec <>»'»'ni 
structurally .dentified homo og, ?„ ^ "V*" *" ph - w 
tremety far in teaucnee and fcv" %£ " h"' .^l*™'** «' 
alignments mav be inaccurate ' denUI> Nole ,h " 

regions show that a^^ LTSS'^^^^iy-Rlltd 
25% or more identi* bm " defecnonT re 'r~ h 'P* «™ «m 
Consequently, the great seouenei ,. be, °* ^5*. 

identified evolutional ^ re.a'S™ ^T"? ° f mo,, l « u «»«»v 
panw« sequence c om £™tZ£Z* "* °' 

are detected and onlv 10% of those with is_->nc- u , 
These results show' that « a ii«f^.7 15 - 20 ^ «n be found. 

of the method j, X^£^'*™«<-«* po » n 
protein sequences. ' 8r "' d,ver 8 e n« of many 

After completion of this work * 
Bl^ST was released: bIwS) 1 lurZ?°" ° f "j^" 
ments. like wu-bi^ ^ddjiJn. I** 3 " 5 gapped ai 'P»- 
mitial tests on biaSS W " h $Um $,alislics ° ur 

E-values are reliabhTandT.f ? tmt Daramele " show that .ts 

quite equal , 0 that " °' Un S aDBed Bl ^- but no. 



CONCLUSION 



Table 1. 



The general consensus amonesi experts (see reft - * -« 
and (ii) „,,„„ t ,'l"T""'' " """P 1 "" »««kta 

b!° ffj S££°J£ "* E "* 1 ' 



ssearch % identity: within alignment 
ssearch % identity: within both 
ssearch % identity: Hssp-scaied 
ssearch Smith-Waterman raw score* 
ssearch E-values 
Fasta letup m } E-valucs 
Fast a ktup - 2 E-values 
wu-blast: P-values 
BLAST P-v aiucs 

- i.mes are from large database searches with genome'protemT 
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extent of errors. Second, ssearch. wu-blactz and facta 
»up = 1 perform best, though bust and facta ktiip = 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found bv sequence com- 
^H^ 63 ? ** d | stin f lished witl > h««h reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the Urge majority of 
dmam evolutionary relationships at u acceptable error rate 
Thus a the procedures assessed here UuZ find a7el«?b| e 

indicates that any relatives it might have are distant ones." 

™££- 0 ?' *? Mpdl,ed m,ormiti< > n «his work, includini 

supplementary figures, may be found at hnp://«s»^anfora.e%C A 

. J h r* U o°1 a ' e g ? tef ,0 °«- A. G. Murzin. M. Levitt. S. R. Eddv 
and G. M.tehison for valuable discussion. S££. was prmS 
supported by a Sl John s College (Cambridge UK) BnS 
Srtolarship and by «be A»enean Friends o? £mXg 
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